feat(plugins-google): add cached_content option for explicit context caching#5675
Open
kamil-bidus wants to merge 1 commit intolivekit:mainfrom
Open
feat(plugins-google): add cached_content option for explicit context caching#5675kamil-bidus wants to merge 1 commit intolivekit:mainfrom
kamil-bidus wants to merge 1 commit intolivekit:mainfrom
Conversation
…caching The plugin currently relies on Gemini's implicit cache, which is heuristic. In voice-agent workloads where the system prompt is large and stable across calls, implicit caching often misses on turn 1 of a conversation, paying the full cold-start cost. Explicit caching is the documented alternative: the application creates a CachedContent resource via client.caches.create(...) and references it by name on subsequent generateContent calls. Cached prefix tokens are billed at a discount and processed in under 100ms. The plugin already reads cached_content_token_count from response usage but had no way to set cached_content on requests. This adds the parameter on LLM.__init__, stores it on _LLMOptions, and propagates it into GenerateContentConfig via extra_kwargs. End-to-end usability matters: Gemini rejects generateContent requests that pass cached_content together with system_instruction, tools, or tool_config — those fields belong inside the CachedContent resource. Without handling that, setting cached_content on any LLM that also has a system prompt or function tools would 400. So LLMStream._run now suppresses system_instruction, tools, and tool_config from the outgoing request whenever cached_content is attached. Cache lifecycle (creation, TTL refresh, deletion) and the choice of what to bake into the cache stay the application's responsibility — the plugin only consumes the resource name and ensures the matching fields are absent from the request. Behaviour is unchanged for callers that don't pass cached_content: the gating is strictly is-given on that one option. Documented on the docstring so users know the cache must contain whichever of system_instruction / tools the model needs. Tests cover propagation, the omitted-when-not-set default, and the three suppression branches (system_instruction stripped, tools stripped, tool_config stripped) plus the unchanged-when-no-cache backward-compat path. Refs livekit#2359.
c57dd80 to
657894a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The Gemini plugin's
LLMclass supports manyGenerateContentConfigoptions (thinking_config, retrieval_config, safety_settings, etc.) but notcached_content. The plugin already readscached_content_token_countfrom response usage inLLMStream._parse_part, so cache hits surface in metrics — there's just no way to attach aCachedContentresource to outgoing requests.Change
Add
cached_content: NotGivenOr[str] = NOT_GIVENtoLLM.__init__, propagated through_LLMOptions→chat()→extra["cached_content"]→GenerateContentConfigvia**self._extra_kwargs.Request-side suppression
Gemini's API rejects
generateContentrequests that passcached_contenttogether withsystem_instruction,tools, ortool_config— those fields belong inside theCachedContentresource itself, and the API returns a 400 instructing callers to move them.Without handling that, the parameter would 400 on any realistic agent. So
LLMStream._runstripssystem_instruction,tools, andtool_configfrom the outgoing request whenevercached_contentis attached. Behaviour is unchanged whencached_contentis unset.Cache lifecycle (creation, TTL refresh, deletion) and the choice of what to bake into the cache stay the application's responsibility.
Compatibility
Default
NOT_GIVENkeeps existing behaviour unchanged — verified by tests covering both the omission case (no key in_extra_kwargs) and the no-cache request path (system_instructionandtoolspropagate as before).Works with both Gemini Developer API (
cachedContents/{id}) and Vertex AI (projects/{p}/locations/{l}/cachedContents/{id}); the plugin passes the string through unmodified.Tests
tests/test_plugin_google_llm.py— 6 cases:cached_contentround-trips through_LLMOptionsand reaches_extra_kwargs; defaultNOT_GIVENproduces no key.client.aio.models.generate_content_streamto capture theGenerateContentConfig, the request omitssystem_instruction/tools/tool_configwhencached_contentis set, and includes them when it isn't (backward compat).Existing google-plugin tests still pass.
ruff check/ruff formatclean.Refs